## Intel<sup>®</sup> oneAPI VTune<sup>™</sup> Profiler 2021.1.1 Gold

**Elapsed Time:** 0.194s

Application execution time is too short. Metrics data may be unreliable. Consider reducing the sampling interval or increasing your application execution time.

 Clockticks:
 453,780,000

 Instructions Retired:
 811,800,000

CPI Rate: 0.559 MUX Reliability: 0.988

**Retiring:** 48.3% of Pipeline Slots 44.7% of Pipeline Slots

FP Arithmetic:
FP x87:
FP Scalar:
FP Vector:

Other:

avy Operations:

0.0% of uOps
0.0% of uOps
0.0% of uOps
100.0% of uOps
100.0% of uOps
3.6% of Pipeline Slots

**Heavy Operations:** 3.6% of Pipeline Slots **Microcode Sequencer:** 4.0% of Pipeline Slots **Assists:** 0.0% of Pipeline Slots

Front-End Bound:
Front-End Latency:
ICache Misses:
ITLB Overhead:
Branch Resteers:
Mispredicts Resteers:

0.0 % of Tipcline Slots
9.8% of Pipeline Slots
1.2% of Clockticks
0.8% of Clockticks
3.5% of Clockticks

Clears Resteers: 0.0% of Clockticks
Unknown Branches: 1.1% of Clockticks
DSB Switches: 2.4% of Clockticks
Length Changing Prefixes: 0.0% of Clockticks
MS Switches: 2.4% of Clockticks
Front-End Bandwidth: 5.8% of Pipeline Slots
Front-End Bandwidth MITE: 27.1% of Clockticks

Front-End Bandwidth DSB: 3.7% of Clockticks

(Info) DSB Coverage: 39.5%

Bad Speculation:
Branch Mispredict:
Machine Clears:

Back-End Bound:

6.2% of Pipeline Slots
6.2% of Pipeline Slots
0.0% of Pipeline Slots

A significant portion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable to support. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause

this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).

13.5% of Pipeline Slots **Memory Bound:** 6.0% of Clockticks L1 Bound: **DTLB Overhead:** 1.0% of Clockticks Load STLB Hit: 0.0% of Clockticks 1.0% of Clockticks Load STLB Miss: **Loads Blocked by Store Forwarding:** 0.8% of Clockticks 0.0% of Clockticks **Lock Latency:** 0.0% of Clockticks **Split Loads:** 0.7% of Clockticks 4K Aliasing: FB Full: 0.0% of Clockticks L2 Bound: 0.0% of Clockticks L3 Bound: 2.4% of Clockticks 0.0% of Clockticks **Contested Accesses:** 0.0% of Clockticks Data Sharing: 5.4% of Clockticks L3 Latency: 0.0% of Clockticks **SQ Full: DRAM Bound:** 3.6% of Clockticks 6.0% of Clockticks **Memory Bandwidth: Memory Latency:** 10.7% of Clockticks **Store Bound:** 3.6% of Clockticks **Store Latency:** 12.6% of Clockticks 0.0% of Clockticks False Sharing: **Split Stores:** 0.1% of Clockticks **DTLB Store Overhead:** 4.1% of Clockticks Store STLB Hit: 2.8% of Clockticks Store STLB Hit: 1.3% of Clockticks **Core Bound:** 16.4% of Pipeline Slots

This metric represents how much Core non-memory issues were of a bottleneck. Shortage in hardware compute resources, or dependencies software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an 000 resources, certain execution units are overloaded or dependencies in program's data- or instruction- flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).

**Divider:** 0.0% of Clockticks **Port Utilization:** 18.8% of Clockticks

Cycles of 0 Ports Utilized: 16.6% of Clockticks
Serializing Operations: 9.5% of Clockticks
Mixing Vectors: 0.0% of uOps

Cycles of 1 Port Utilized: 6.2% of Clockticks
Cycles of 2 Ports Utilized: 8.0% of Clockticks
Cycles of 3+ Ports Utilized: 20.9% of Clockticks

ALU Operation Utilization: 28.9% of Clockticks
Port 0: 25.9% of Clockticks
Port 5: 28.3% of Clockticks
Port 6: 27.1% of Clockticks
Port 2: 27.1% of Clockticks
Port 3: 27.1% of Clockticks

Store Operation Utilization: 24.6% of Clockticks
Port 4: 24.6% of Clockticks
Port 7: 9.8% of Clockticks

**Vector Capacity Usage (FPU): 0.0%** 

**Average CPU Frequency:** 2.565 GHz

**Total Thread Count:** 1 Paused Time: 0s

**Effective Physical Core Utilization:** 22.1% (0.883 out of 4)

The metric value is low, which may signal a poor physical CPU cores utilization caused by:

- load imbalance
- threading runtime overhead
- contended synchronization
- thread/process underutilization
- incorrect affinity that utilizes logical cores instead of physical cores

Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits analysis to identify parallel bottlenecks for other parallel runtimes.

## **Effective Logical Core Utilization:** 11.4% (0.914 out of 8)

The metric value is low, which may signal a poor logical CPU cores utilization. Consider improving physical core utilization as the first step and then look at opportunities to utilize logical cores, which in some cases can improve processor throughput and overall performance of multi-threaded applications.

## **Collection and Platform Info:**

**Application Command Line:** ./codecs/hm/decoder/TAppDecoderStatic "-b" "./bin/hm/encoder\_lowdelay\_main.cfg/CLASS\_A/ Kimono\_1920x1080\_24\_QP\_27\_hm.bin"

**User Name:** root

**Operating System:** 5.4.0-65-generic DISTRIB\_ID=Ubuntu DISTRIB\_RELEASE=18.04 DISTRIB\_CODENAME=bionic DISTRIB\_DESCRIPTION="Ubuntu 18.04.5 LTS"

**Computer Name:** eimon

**Result Size:** 13.6 MB

**Collection start time:** 09:44:28 10/02/2021 UTC

**Collection stop time:** 09:44:29 10/02/2021 UTC

**Collector Type:** Event-based sampling driver

CPU:

Name: Intel(R) Processor code named Kabylake

ULX

**Frequency:** 1.992 GHz

**Logical CPU Count:** 8

**Cache Allocation Technology:** 

Level 2 capability: not detected

**Level 3 capability:** not detected